233 research outputs found
Improving Transformer-based Image Matching by Cascaded Capturing Spatially Informative Keypoints
Learning robust local image feature matching is a fundamental low-level
vision task, which has been widely explored in the past few years. Recently,
detector-free local feature matchers based on transformers have shown promising
results, which largely outperform pure Convolutional Neural Network (CNN) based
ones. But correlations produced by transformer-based methods are spatially
limited to the center of source views' coarse patches, because of the costly
attention learning. In this work, we rethink this issue and find that such
matching formulation degrades pose estimation, especially for low-resolution
images. So we propose a transformer-based cascade matching model -- Cascade
feature Matching TRansformer (CasMTR), to efficiently learn dense feature
correlations, which allows us to choose more reliable matching pairs for the
relative pose estimation. Instead of re-training a new detector, we use a
simple yet effective Non-Maximum Suppression (NMS) post-process to filter
keypoints through the confidence map, and largely improve the matching
precision. CasMTR achieves state-of-the-art performance in indoor and outdoor
pose estimation as well as visual localization. Moreover, thorough ablations
show the efficacy of the proposed components and techniques.Comment: Accepted by ICCV2023, Codes will be released in
https://github.com/ewrfcas/CasMT
Taking a look at small-scale pedestrians and occluded pedestrians
Small-scale pedestrian detection and occluded pedestrian detection are two challenging tasks. However, most state-of-the-art methods merely handle one single task each time, thus giving rise to relatively poor performance when the two tasks, in practice, are required simultaneously. In this paper, it is found that small-scale pedestrian detection and occluded pedestrian detection actually have a common problem, i.e., an inaccurate location problem. Therefore, solving this problem enables to improve the performance of both tasks. To this end, we pay more attention to the predicted bounding box with worse location precision and extract more contextual information around objects, where two modules (i.e., location bootstrap and semantic transition) are proposed. The location bootstrap is used to reweight regression loss, where the loss of the predicted bounding box far from the corresponding ground-truth is upweighted and the loss of the predicted bounding box near the corresponding ground-truth is downweighted. Additionally, the semantic transition adds more contextual information and relieves semantic inconsistency of the skip-layer fusion. Since the location bootstrap is not used at the test stage and the semantic transition is lightweight, the proposed method does not add many extra computational costs during inference. Experiments on the challenging CityPersons and Caltech datasets show that the proposed method outperforms the state-of-the-art methods on the small-scale pedestrians and occluded pedestrians (e.g., 5.20% and 4.73% improvements on the Caltech)
Learning Prior Feature and Attention Enhanced Image Inpainting
Many recent inpainting works have achieved impressive results by leveraging
Deep Neural Networks (DNNs) to model various prior information for image
restoration. Unfortunately, the performance of these methods is largely limited
by the representation ability of vanilla Convolutional Neural Networks (CNNs)
backbones.On the other hand, Vision Transformers (ViT) with self-supervised
pre-training have shown great potential for many visual recognition and object
detection tasks. A natural question is whether the inpainting task can be
greatly benefited from the ViT backbone? However, it is nontrivial to directly
replace the new backbones in inpainting networks, as the inpainting is an
inverse problem fundamentally different from the recognition tasks. To this
end, this paper incorporates the pre-training based Masked AutoEncoder (MAE)
into the inpainting model, which enjoys richer informative priors to enhance
the inpainting process. Moreover, we propose to use attention priors from MAE
to make the inpainting model learn more long-distance dependencies between
masked and unmasked regions. Sufficient ablations have been discussed about the
inpainting and the self-supervised pre-training models in this paper. Besides,
experiments on both Places2 and FFHQ demonstrate the effectiveness of our
proposed model. Codes and pre-trained models are released in
https://github.com/ewrfcas/MAE-FAR.Comment: ECCV 202
- …